Efficient end-to-end long-read sequence mapping using minimap2-fpga integrated with hardware accelerated chaining

minimap2 is the gold-standard software for reference-based sequence mapping in third-generation long-read sequencing. While minimap2 is relatively fast, further speedup is desirable, especially when processing a multitude of large datasets. In this work, we present minimap2-fpga, a hardware-accelerated version of minimap2 that speeds up the mapping process by integrating an FPGA kernel optimised for chaining. Integrating the FPGA kernel into minimap2 posed significant challenges that we solved by accurately predicting the processing time on hardware while considering data transfer overheads, mitigating hardware scheduling overheads in a multi-threaded environment, and optimizing memory management for processing large realistic datasets. We demonstrate speed-ups in end-to-end run-time for data from both Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio). minimap2-fpga is up to 79% and 53% faster than minimap2 for \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 30\times $$\end{document}∼30× ONT and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 50\times $$\end{document}∼50× PacBio datasets respectively, when mapping without base-level alignment. When mapping with base-level alignment, minimap2-fpga is up to 62% and 10% faster than minimap2 for \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 30\times $$\end{document}∼30× ONT and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 50\times $$\end{document}∼50× PacBio datasets respectively. The accuracy is near-identical to that of original minimap2 for both ONT and PacBio data, when mapping both with and without base-level alignment. minimap2-fpga is supported on Intel FPGA-based systems (evaluations performed on an on-premise system) and Xilinx FPGA-based systems (evaluations performed on a cloud system). We also provide a well-documented library for the FPGA-accelerated chaining kernel to be used by future researchers developing sequence alignment software with limited hardware background.


FPGA-based Chaining Step Accelerator
The FPGA-based hardware accelerator architecture in our previous work 6 was extended as shown in Figure S2.With the modifications to the hardware accelerator, it was able to achieve better accuracy (compare Supplementary Figure S1 vs. Figure 1e,1f) in the chaining output while keeping the FPGA resource usage low (so that the accelerator could fit in the limited FPGA area available to use).Furthermore, we successfully maintained the desired speed-up factor throughout the optimization process.Algorithm 1 describes the HLS algorithm used to generate the modified hardware architecture in Figure S2.
minimap2's software chaining algorithm (Algorithm 2) computes H (at most) chaining scores for each anchor, using previous H (at most) anchors (for loop in lines 17-23 of Algorithm 2).In our previous hardware accelerator implementation 6 , we parallelized this computation by utilizing H score computation units to perform the computations in parallel.Considering the FPGA resource limitations and minimap2's original algorithm definitions, H was set to 64.However, to achieve a mapping accuracy similar to software implementation of minimap2, H needed to be increased.Increasing the number of parallel score computation units (H) directly to achieve better accuracy, resulted in a design that couldn't fit in the FPGA being used.To resolve this resource utilization issue, the hardware architecture is modified so that the H score computations corresponding to a single anchor are performed in M sub-parts (see Figure S2 and Algorithm 1).Score computations in a sub-part are performed in parallel with only P score computation units (note that previous accelerator 6 needed H score computation units).As most recently used H anchors (A[i]) and the maximum score values corresponding to them (F[i]), are needed for subsequent score computations, H = M × P anchors and their maximum score values are stored in first-in, first-out (FIFO) buffers implemented with hardware shift registers.
The modified hardware accelerator with additional control logic was optimized to have an Initiation Interval (II) of 3, making it able to process a sub-part every 3 clock cycles.

Pipelined and unrolled inner loop computation
Subpart Kernel 1 Kernel N

Device DRAM
A The architecture of the FPGA-based hardware accelerator designed for minimap2's chaining step acceleration The hardware architecture of our previous hardware accelerator 6 was modified to achieve better accuracy while preserving the speed-up when compared to original minimap2 software.
the counter used to denote the current anchor being processed 5: subparts_processed = 0 ▷ Initialize a counter to keep track of already processed sub-parts for current anchor 6: subparts_to_process = 0 ▷ Initialize a variable to store the total number of sub-parts for current anchor 7: for g = 0; g < total_subparts; g++ do ▷ Loop over all sub-parts of the chaining task 8: ▷ If all the sub-parts for previous anchor are processed, load next anchor (into FIFO A[]) and its sub-part count 9: if subparts_processed == 0 then 10: ▷ Initialize best score (max_f ) and predecessor (max_j) for the current sub-part being processed ▷ Update best score and predecessor for the anchor i if max_f found from the current sub-part is greater than the best score found so far for the anchor (which is stored in subparts_processed++ ▷ Increment the processed sub-parts counter 34: ▷ Shift locally stored anchors and chaining scores in the FIFO buffers if all the sub-parts for the current anchor are processed A novel hardware-software split mechanism was introduced to process the chaining tasks either on hardware accelerator or the CPU (as software) based on their computation times predicted with theoretical models.Equation 1 and Equation 2 give the models derived to predict the time taken for a given chaining task to run on hardware accelerator and on CPU (as software) respectively.
The time taken to process a given chaining task on hardware (T hardware , given in Equation 1) is the sum of the times taken for data transfer (T data_transfer = input transfer from host to the hardware accelerator + output results transfer from hardware accelerator to the host) and the execution time on the hardware accelerator (T execution ).The data transfer time (T data_transfer ) is proportional to the number of anchors being transferred (n).As the output for a single subpart is generated by the accelerator every II clock cycles, the execution time in terms of clock cycles is II × total_subparts.Multiplying the clock cycles by the clock period (T clock ) gives the execution time in seconds (T execution ).The constant C 1 accounts for the initial pipeline delay and the overhead time taken for OpenCL API calls by the host.
To derive the relationship between the time taken to process a given chaining task on the CPU as software (T so f tware , as given in Equation 2), it is necessary to refer to the original software chaining algorithm.This algorithm is presented in Algorithm 2 (also presented in our previous work 6 ).The number of computational steps can be estimated by the total number of iterations (i.e. total trip count) taken by the for loop in Lines 17-23 of Algorithm 2 and is represented by Computing the total trip count corresponding to a single anchor can be done prior to the real execution of the for loop by using the while loop in Lines 13-15 of Algorithm 2 (this loop computes the starting index (st) used in the for loop in Lines 17-23 and min(i − H, st) gives the trip count corresponding to the anchor).Summing up the trip count values corresponding to all the anchors gives n ∑ i=1 trip_count i in Equation 2. C 2 in Equation 2 accounts for the overhead setup time for the for loop in Lines 17-23 of Algorithm 2.
Based on a one-time experiment done with a representative dataset, the values for the coefficients in Equation 1and 2 (i.e.K 1 , K 2 , C 1 , C 2 ) are found.After, T hardware and T so f tware are calculated for each chaining task at the run time.If T hardware < T so f tware , the chaining task is decided to be processed on hardware (see Supplementary Note 3 for hardware scheduling details), otherwise, the chaining task is processed on the host CPU itself as software.
Algorithm 2 minimap2's Chaining Algorithm ▷ Store maximum score and best predecessor for anchor i 25: p[i] = max_j 27: end for 3 Hardware-Software Integration The FPGA-based hardware accelerator is carefully integrated to the minimap2 original software while minimally affecting the multi-threaded behavior of original software so that an efficient reduction in tool's end-to-end runtime can be achieved.OpenCL API functions are used for data transfer and communication between the host CPU (software) and the FPGA-based accelerator (hardware).
minimap2 processes DNA reads on software with multiple threads by using a fork-join thread model.This allows minimap2 to parallelize the computations being performed while maximally utilizing the multiple cores available in the CPU.When a chaining request is issued by any of the software threads of hardware-software integrated version of minimap2 (i.e.mm_chain_dp function is called by a software thread), it is decided whether the requested chaining task should be processed on hardware or software with the hardware-software split mechanism discussed in Supplementary Note 2. If the chaining task is decided to be processed on software, the associated software thread continues to execute the chaining task on the CPU itself.
Since the FPGA devices are resource-constrained, the number of hardware kernels (N in Fig. S2) that can be configured on the device is usually lower than the number of software threads available on a typical high performance computing (HPC) system.Therefore, when the chaining task is decided to be processed on hardware, each software thread in the host CPU tries to schedule the chaining computation on one of the N hardware kernels configured on the FPGA device as given in Algorithm 3.
In lines 1-18 of Algorithm 3, given the predicted hardware and software execution times, the algorithm tries to find a hardware kernel on which the total time to finish processing the chaining task (total_hw_time_pred, which is the sum of the predicted hardware processing time (hw_time_pred) and the waiting time to access that kernel (wait_time)), is smaller than the predicted software execution time (sw_time_pred).If total_hw_time_pred is greater than or equal to sw_time_pred, the associated software thread executes the chaining task on the CPU itself.If total_hw_time_pred is smaller than sw_time_pred, the hardware kernel which satisfied the condition is recorded (kernel_id) and the chaining task is inserted into a first-in, first-out (FIFO) queue corresponding to the hardware kernel (hw_queue[kernel_id]), so that the CPU can wait and execute the task on the hardware kernel when the chaining task comes first in the queue.Lines 20-33 of Algorithm 3 correspond to the section where it takes the chaining task out from the queue and processes it on the previously recorded hardware kernel (kernel_id) when it is ready to do so.As this algorithm is performed on multiple software threads running in parallel and the N hardware kernels and the N hardware queues are shared among the multiple threads, necessary synchronization mechanisms (highlighted in blue color in Algorithm 3) are implemented with two mutex locks for each hardware kernel -one for accessing/modifying the queue corresponding to the kernel (called "queue lock") and one for accessing the kernel for chaining task execution (called "execution lock").

Performance Benchmarking
The commands used for the performance benchmarking detailed under "Performance Comparison" are given below.The same commands were used both on Intel FPGA-based system and the Xilinx FPGA-based system.

Accuracy Benchmarking
The commands used for generating the simulated reads used in "Accuracy Evaluation" are given below.

Figure S1 .
Figure S1.Accuracy of minimap2-fpga when combined with previous hardware accelerator 6 on the Intel FPGA-based system (a) Accuracy comparison performed with simulated ONT reads for original minimap2 vs. minimap2-fpga with base-level alignment.(b) Accuracy comparison performed with simulated ONT reads for original minimap2 vs. minimap2-fpga without base-level alignment.